In [ ]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from codefiles.datagen import random_xy, x_plus_noise, data_3d
from codefiles.dataplot import plot_principal_components, plot_3d, plot_2d
# %matplotlib inline
%matplotlib notebook
In [ ]:
data_random = random_xy(num_points=100)
plot_2d(data_random)
Initialize PCA - recall that we won't need any kind of target column since this is an unsupervised technique.
In [ ]:
pca_random = PCA()
Now, let's give it the random data.
In [ ]:
pca_random.fit(data_random)
And evaluate the variance of the data. Will we have some axis with significant more variance?
In [ ]:
pca_random.explained_variance_
As we can see, there is not a huge difference in variance between the two axis. It was expected. If we increase num_points
in the random_xy()
, we'll see them closer together.
We'll now assess a correlated dataset. Check here for a nice gif on PCA
In [ ]:
# Correlated data
data_correlated = x_plus_noise(slope=1)
plot_2d(data_correlated)
Initialize and fit the correlated data.
In [ ]:
pca_correlated = PCA()
pca_correlated.fit(data_correlated)
In [ ]:
pca_correlated.explained_variance_
Now we can see a principal component with a significant higher magnitude than the other one. There's definitely some knowledge we can use about this, e.g., use only one dimensional data if we have/need to, without losing much information.
Hint: check x_plus_noise()
with slope=-1
.